Tree Methods Project

For this project we will be exploring the use of tree methods to classify schools as Private or Public based off their features.

Let's start by getting the data which is included in the ISLR library, the College data frame.


A data frame with 777 observations on the following 18 variables.

  • Private A factor with levels No and Yes indicating private or public university
  • Apps Number of applications received
  • Accept Number of applications accepted
  • Enroll Number of new students enrolled
  • Top10perc Pct. new students from top 10% of H.S. class
  • Top25perc Pct. new students from top 25% of H.S. class
  • F.Undergrad Number of fulltime undergraduates
  • P.Undergrad Number of parttime undergraduates
  • Outstate Out-of-state tuition
  • Room.Board Room and board costs
  • Books Estimated book costs
  • Personal Estimated personal spending
  • PhD Pct. of faculty with Ph.D.’s
  • Terminal Pct. of faculty with terminal degree
  • S.F.Ratio Student/faculty ratio
  • perc.alumni Pct. alumni who donate
  • Expend Instructional expenditure per student
  • Grad.Rate Graduation rate

Get the Data

Call the ISLR library and check the head of College (a built-in data frame with ISLR, use data() to check this.) Then reassign College to a dataframe called df

In [1]:
In [2]:
Out[2]:
PrivateAppsAcceptEnrollTop10percTop25percF.UndergradP.UndergradOutstateRoom.BoardBooksPersonalPhDTerminalS.F.Ratioperc.alumniExpendGrad.Rate
Abilene Christian UniversityYes1660123272123522885537744033004502200707818.112704160
Adelphi UniversityYes218619245121629268312271228064507501500293012.2161052756
Adrian CollegeYes1428109733622501036991125037504001165536612.930873554
Agnes Scott CollegeYes41734913760895106312960545045087592977.7371901659
Alaska Pacific UniversityYes193146551644249869756041208001500767211.921092215
Albertson CollegeYes58747915838626784113500333550067567739.411972755
In [3]:

EDA

Let's explore the data!

Create a scatterplot of Grad.Rate versus Room.Board, colored by the Private column.

In [4]:
In [5]:

Create a histogram of full time undergrad students, color by Private.

In [6]:

Create a histogram of Grad.Rate colored by Private. You should see something odd here.

In [7]:

What college had a Graduation Rate of above 100% ?

In [ ]:

Change that college's grad rate to 100%

In [9]:

Train Test Split

Split your data into training and testing sets 70/30. Use the caTools library to do this.

In [10]:

Decision Tree

Use the rpart library to build a decision tree to predict whether or not a school is Private. Remember to only build your tree off the training data.

In [11]:
In [12]:

Use predict() to predict the Private label on the test data.

In [26]:

Check the Head of the predicted values. You should notice that you actually have two columns with the probabilities.

In [27]:
Out[27]:
NoYes
Adrian College0.0033112580.996688742
Alfred University0.0033112580.996688742
Allegheny College0.0033112580.996688742
Allentown Coll. of St. Francis de Sales0.0033112580.996688742
Alma College0.0033112580.996688742
Amherst College0.0033112580.996688742

Turn these two columns into one column to match the original Yes/No Label for a Private column.

In [32]:
In [33]:
In [ ]:

Now use table() to create a confusion matrix of your tree model.

In [37]:
Out[37]:
     
       No Yes
  No   57   9
  Yes   7 160

Use the rpart.plot library and the prp() function to plot out your tree model.

In [38]:

Random Forest

Now let's build out a random forest model!

Call the randomForest package library

In [39]:

Now use randomForest() to build out a model to predict Private class. Add importance=TRUE as a parameter in the model. (Use help(randomForest) to find out what this does.

In [42]:

What was your model's confusion matrix on its own training set? Use model$confusion.

In [47]:
Out[47]:
NoYesclass.error
No128.0000000 20.0000000 0.1351351
Yes 11.00000000385.00000000 0.02777778

Grab the feature importance with model$importance. Refer to the reading for more info on what Gini[1] means.[2]

In [50]:
Out[50]:
NoYesMeanDecreaseAccuracyMeanDecreaseGini
Apps 0.02728403 0.01548610 0.0185583311.12145164
Accept 0.02758290 0.01376933 0.0175392511.90902939
Enroll 0.03543096 0.02850524 0.0301982321.78309626
Top10perc0.0120749270.0048766020.0068292005.878231682
Top25perc0.0059399660.0045743980.0049561384.636132072
F.Undergrad 0.14066600 0.06924742 0.0887625837.37014532
P.Undergrad 0.048479660 0.005760397 0.01736136415.537621126
Outstate 0.14299823 0.06438651 0.0855553741.35682991
Room.Board 0.01640514 0.01315531 0.0140765012.21199808
Books 0.0018960791-0.0002561684 0.0003830364 2.2014363026
Personal0.0042126040.0016888640.0023367893.651182123
PhD0.0104133830.0053850620.0067560724.651126830
Terminal0.0034246530.0043071620.0041273113.996871039
S.F.Ratio 0.034579932 0.008941163 0.01592284916.901749643
perc.alumni0.0232820860.0028155390.0083569194.969591602
Expend 0.02378523 0.01147495 0.0148598410.16450065
Grad.Rate0.014251910.005546950.007879156.75926369

Predictions

Now use your random forest model to predict on your test set!

In [51]:
In [52]:
Out[52]:
     
p      No Yes
  No   57   5
  Yes   7 164

It should have performed better than just a single tree, how much better depends on whether you are emasuring recall, precision, or accuracy as the most important measure of the model.

Great Job!